Theory of Deep Learning III: the non-overfitting puzzle

نویسندگان

T. Poggio

K. Kawaguchi

Q. Liao

B. Miranda

L. Rosasco

چکیده

A main puzzle of deep networks revolves around the apparent absence of overfitting intended as robustness of the expected error against overparametrization, despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error, to a gradient system in a quadratic potential with a degenerate (for square loss) or almost degenerate (for logistic or crossentropy loss) Hessian. The proposition depends on the qualitative theory of dynamical systems and is supported by numerical results. The result extends to deep nonlinear networks two key properties of gradient descent for linear networks, that have been recently recognized (1) to provide a form of implicit regularization: 1. For classification, which is the main application of today’s deep networks, there is asymptotic convergence to the maximum margin solution by minimization of loss functions such as the logistic, the cross entropy and the exp-loss . The maximum margin solution guarantees good classification error for “low noise” datasets. Importantly, this property holds independently of the initial conditions. Because of this property, our proposition guarantees a maximum margin solution also for deep nonlinear networks. 2. Gradient descent enforces a form of implicit regularization controlled by the number of iterations, and asymptotically converges to the minimum norm solution for appropriate initial conditions of gradient descent. This implies that there is usually an optimum early stopping that avoids overfitting of the expected risk. This property, valid for the square loss and many other loss functions, is relevant especially for regression. In the case of deep nonlinear networks the solution however is not expected to be strictly minimum norm, unlike the linear case. The robustness to overparametrization has suggestive implications for the robustness of the architecture of deep convolutional networks with respect to the curse of dimensionality.

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

Comparison of Lecture and Puzzle for Teaching Medical Emergency to Anesthesiology Students: Students’ Learning and Viewpoints

Introduction: Emphasis on active learning in training leads to the development of new educational strategies for teaching theoretical and clinical courses in medical sciences. Active involvement in the teaching-learning process improves learning. The purpose of the present study is to compare two methods of lecturing and puzzle in Medical Emergency Course, regarding students’ learning & viewpoi...

متن کامل

Theory of Deep Learning III: explaining the non-overfitting puzzle

A main puzzle of deep networks revolves around the absence of overfitting despite large overparametrization and despite the large capacity demonstrated by zero training error on randomly labeled data. In this note, we show that the dynamics associated to gradient descent minimization of nonlinear networks is topologically equivalent, near the asymptotically stable minima of the empirical error,...

متن کامل

Theory of Deep Learning III: Generalization Properties of SGD

In Theory III we characterize with a mix of theory and experiments the consistency and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predicitve performance when overparametrization relative to the number of training data suggests overfitting. We describe an exp...

متن کامل

Musings on Deep Learning: Properties of SGD

We ruminate with a mix of theory and experiments on the optimization and generalization properties of deep convolutional networks trained with Stochastic Gradient Descent in classification tasks. A present perceived puzzle is that deep networks show good predictive performance when overparametrization relative to the number of training data suggests overfitting. We dream an explanation of these...

متن کامل

The effect of Electronic puzzle Games on Improving Reading Performance of Students with learning disorders

The purpose of this study was to investigate the effect of Electronic puzzle Games on improving the reading performance of students with learning disorders in Gorgan. The research was applied to the target and quasi-experimental, pre-test and post-test design with control group. The statistical population consisted of 255 female students of Shokufa and Dena provincial centers and non-profit cen...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2018

Theory of Deep Learning III: the non-overfitting puzzle

نویسندگان

چکیده

منابع مشابه

Comparison of Lecture and Puzzle for Teaching Medical Emergency to Anesthesiology Students: Students’ Learning and Viewpoints

Theory of Deep Learning III: explaining the non-overfitting puzzle

Theory of Deep Learning III: Generalization Properties of SGD

Musings on Deep Learning: Properties of SGD

The effect of Electronic puzzle Games on Improving Reading Performance of Students with learning disorders

عنوان ژورنال:

اشتراک گذاری